Linguistically Annotated Corpus as an Invaluable Resource for Advancements in Linguistic Research: A Case Study

نویسندگان

  • Jan Hajič
  • Eva Hajičová
  • Jiří Mírovský
  • Jarmila Panevová
چکیده

A case study based on experience in linguistic investigations using annotated monolingual and multilingual text corpora; the “cases” include a description of language phenomena belonging to different layers of the language system: morphology, surface and underlying syntax, and discourse. The analysis is based on a complex annotation of syntax, semantic functions, information structure and discourse relations of the PragueDependency Treebank, a collection of annotatedCzech texts. Wewant to demonstrate that annotation of corpus is not a self-contained goal: in order to be consistent, it should be based on some linguistic theory, and, at the same time, it should serve as a test bed for the given linguistic theory in particular and for linguistic research in general.1

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CINTIL DependencyBank PREMIUM - A Corpus of Grammatical Dependencies for Portuguese

This paper presents a new linguistic resource for the study and computational processing of Portuguese. CINTIL DependencyBank PREMIUM is a corpus of Portuguese news text, accurately manually annotated with a wide range of linguistic information (morpho-syntax, named-entities, syntactic function and semantic roles), making it an invaluable resource specially for the development and evaluation of...

متن کامل

Linguistically Annotated Learner Corpora: Aspects of a Layered Linguistic Encoding and Standardized Representation

Linguistically annotated corpora that are stored in standardized digital form can be a valuable source of empirical insight. They can help verify linguistic generalizations and support the formulation of new hypotheses. The linguistic annotation of such corpora often is crucial for their effective exploration from a linguistic perspective. The annotation essentially serves as an index to the li...

متن کامل

Addressing the Resource Bottleneck to Create Large-Scale Annotated Texts

Large-scale linguistically annotated resources have become available in recent years. This is partly due to sophisticated automatic and semiautomatic approaches that work well on specific tasks such as part-ofspeech tagging. For more complex linguistic phenomena like anaphora resolution there are no tools that result in high-quality annotations without massive user intervention. Annotated corpo...

متن کامل

Repurposing Theoretical Linguistic Data for Tool Development and Search

For the majority of the world’s languages, the number of linguistic resources (e.g., annotated corpora and parallel data) is very limited. Consequently, supervised methods, as well as many unsupervised methods, cannot be applied directly, leaving these languages largely untouched and unnoticed. In this paper, we describe the construction of a resource that taps the large body of linguistically ...

متن کامل

Rethinking The Corpus: Moving towards Dynamic Linguistic Resources

The corpus is an invaluable resource in Spoken and Natural Language Processing. Consistent data sets have allowed for empirical evaluation of competing algorithms. The sharing of high-quality annotated linguistic data has enabled participation and experimentation by a wide range of researchers. However, despite dubbing these annotations as " gold-standard " , many corpora contain labeling error...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016